Predictability of Kullback-Leibler (KL) Divergence#
Introduction#
Streamflow prediction in ungauged basins has classically focused on minimizing one or more loss functions, typically square error, NSE, and KGE, between observed (daily streamflow) values and predictions generated by some type of model, whether it be an empirical regionalization, statistical machine learning, or process-based rainfall runoff model.
Regionalization and machine learning models for PUB rely on existing streamflow monitoring network, and the performance of these models is tied to how well the network represents the ungauged space. This connection leads to the question of how the arrangement of streamflow monitoring stations within the network impacts the overall performance of PUB models, particularly in terms of expected prediction error across all ungauged locations. Furthermore, are there environmental signals, orthogonal to streamflow, that contain enough information to differentiate between network arrangements such that the prediction error over ungauged areas can be minimized?
A simple interpretation of the loss functions commonly used in the PUB literature might be “how close are mean daily streamflow predictions to observed values?” A much simpler question to ask of observational data is: “will a given model outperform random guessing in the long run?”. This binary question represents a starting point to approach the optimal streamflow monitoring network problem. The justification for asking such a basic question is that an expectation of the uncertainty reduction over the unmonitored space provided by a given monitoring arrangement supports a discriminant function to compare unique arrangements. A simple question can be formulated to test on real data, in this case an ungauged space of over 1 million ungauged catchments and a set of over 1600 monitored catchments with which to train a model.
The binary prediction problem is followed by a regresson problem where the goal is to minimize the expectation of prediction error based on the Kullback-Leibler divergence \(D_{KL}\), (a surrogate loss function from the class of information discriminant measures which is consistent with the exponential loss []).
Problem Formulation#
Binary Prediction Problem#
The first question “is a given model better than random guessing in the long run” is formulated into a binary classification problem as follows:
The streamflow prediction model assumes that discharge at one location is equal to that a different lcoation on a unit area basis, commonly referred to as an equal unit area runoff (UAR) model.
Given the equal UAR model, an observed (proxy) catchment is a potential model for predicting the distribution of UAR for any unobserved (target) location,
A proxy is “informative” to a target if the proxy UAR is closer to the posterior target UAR than the maximum uncertainty (uniform distribution) prior.
A proxy is “disinformative” to a target if the maximum uncertainty (uniform distribution) prior is closer to the posterior target UAR than the proxy UAR.
The “closeness” of distributions is characterized by three measures from the general class of f-divergences, namely the total variation distance (TVD), the Kullback-Leibler divergence (\(D_{KL}\)), and the earth mover’s distance (EMD), also known as the Wasserstein distance.
For the Kullback-Leibler divergence \(D_{KL}\):
The (posterior) target UAR distribution is denoted by \(P\), and the proxy UAR distribution (model) is denoted by \(Q\).
A proxy model is informative for some target location if \(D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0\)
The binary problem formulation is then:
The discriminant function maps the difference in the two divergences to a binary outcome corresponding to the sign of the resulting quantity \(Y = +1 \text{ if } D_{KL}(P||\mathbb{U}) - D_{KL}(P||Q) > 0 \text{ or } -1 \text{ otherwise}\)
The goal is to miminize the probability of incorrect predictions, defined by the (Bayes) error \(R_{\textbf{Bayes}}(\gamma, C) := \mathbb{P} \left(Y \neq \text{sign}(D_{KL}(\mathbf{P}||\mathbb{U}) - D_{KL}(\mathbf{P}||\mathbf{Q}) \right)\)
Regression Problem#
The same problem setup applies for the regression prediction problem which is to optimize the discriminant function and the input signal quantization simultaneously to minimize the error in predicting the KL divergence from catchment attributes.
Instead of predicting a scalar measure which is a feature of a single location, the key difference in this step is the target variable describes a measure of the difference in runoff between pairs of locations. This approach asks whether the Kullback-Leibler Divergence (KLD) of the distribution of unit area runoff between two locations can be predicted from the attributes of both catchments (and their differences) using the gradient boosted decision tree method, which is also capable of predicting continuous variables, in this case \(D_{KL}\).
…
Data Import and Model Setup#
import os
import pandas as pd
import numpy as np
from time import time
from bokeh.plotting import figure, show
from bokeh.layouts import gridplot, row, column
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
from bokeh.palettes import Sunset10, Vibrant7, Category20
import xgboost as xgb
xgb.config_context(verbosity=2)
from sklearn.metrics import (
root_mean_squared_error,
mean_absolute_error,
roc_auc_score,
accuracy_score,
confusion_matrix,
)
import data_processing_functions as dpf
from scipy.stats import linregress
output_notebook()
BASE_DIR = os.getcwd()
# load the catchment characteristics
fname = 'BCUB_watershed_attributes_updated.csv'
attr_df = pd.read_csv(os.path.join('data', fname))
attr_df.columns = [c.lower() for c in attr_df.columns]
station_ids = attr_df['official_id'].values
print(f'There are {len(station_ids)} monitored basins in the attribute set.')
There are 1609 monitored basins in the attribute set.
Load pairwise attribute comparisons#
Load a few rows from one of the pairwise data files. These contain attributes about divergence measures that are computed on concurrent and non-concurrent time series at two monitored locations.
# open an example pairwise results file
input_folder = os.path.join(
BASE_DIR, "data", "processed_divergence_inputs",
)
pairs_files = os.listdir(input_folder)
test_df = pd.read_csv(os.path.join(input_folder, pairs_files[0]), nrows=1000)
kld_columns = [c for c in test_df.columns if 'dkl' in c]
kld_columns
['dkl_concurrent_uniform',
'dkl_concurrent_post_-5R',
'dkl_concurrent_post_-4R',
'dkl_concurrent_post_-3R',
'dkl_concurrent_post_-2R',
'dkl_concurrent_post_-1R',
'dkl_concurrent_post_0R',
'dkl_concurrent_post_1R',
'dkl_concurrent_post_2R',
'dkl_concurrent_post_3R',
'dkl_concurrent_post_4R',
'dkl_concurrent_post_5R',
'dkl_concurrent_post_6R',
'dkl_concurrent_post_7R',
'dkl_concurrent_post_8R',
'dkl_concurrent_post_9R',
'dkl_concurrent_post_10R',
'dkl_nonconcurrent_uniform',
'dkl_nonconcurrent_post_-5R',
'dkl_nonconcurrent_post_-4R',
'dkl_nonconcurrent_post_-3R',
'dkl_nonconcurrent_post_-2R',
'dkl_nonconcurrent_post_-1R',
'dkl_nonconcurrent_post_0R',
'dkl_nonconcurrent_post_1R',
'dkl_nonconcurrent_post_2R',
'dkl_nonconcurrent_post_3R',
'dkl_nonconcurrent_post_4R',
'dkl_nonconcurrent_post_5R',
'dkl_nonconcurrent_post_6R',
'dkl_nonconcurrent_post_7R',
'dkl_nonconcurrent_post_8R',
'dkl_nonconcurrent_post_9R',
'dkl_nonconcurrent_post_10R']
Define attribute groupings#
terrain = ['drainage_area_km2', 'elevation_m', 'slope_deg', 'aspect_deg'] #'gravelius', 'perimeter',
land_cover = [
'land_use_forest_frac_2010', 'land_use_grass_frac_2010', 'land_use_wetland_frac_2010', 'land_use_water_frac_2010',
'land_use_urban_frac_2010', 'land_use_shrubs_frac_2010', 'land_use_crops_frac_2010', 'land_use_snow_ice_frac_2010']
soil = ['logk_ice_x100', 'porosity_x100']
climate = ['prcp', 'srad', 'swe', 'tmax', 'tmin', 'vp', 'high_prcp_freq', 'high_prcp_duration', 'low_prcp_freq', 'low_prcp_duration']
all_attributes = terrain + land_cover + soil + climate
len(all_attributes)
24
Set trial parameters#
# define the amount of data to set aside for final testing
holdout_pct = 0.10
nfolds = 5
n_boost_rounds = 2500
n_optimization_rounds = 20
priors_to_test = [-2, -1, 0, 1, 2]
#define if testing concurrent or nonconcurrent data
concurrent = 'concurrent'
# partial counts refer to the test where observations were assigned
# a uniform distribution to approximate error and allow fractional
# observations in state space
partial_counts = False
# the input data file has an associated revision date
revision_date = '20240812'
all_test_results = {}
attribute_set_names = ['climate', '+land_cover', '+terrain', '+soil']
attr_df.columns
Index(['region', 'official_id', 'drainage_area_km2', 'centroid_lon_deg_e',
'centroid_lat_deg_n', 'logk_ice_x100', 'porosity_x100',
'land_use_forest_frac_2010', 'land_use_shrubs_frac_2010',
'land_use_grass_frac_2010', 'land_use_wetland_frac_2010',
'land_use_crops_frac_2010', 'land_use_urban_frac_2010',
'land_use_water_frac_2010', 'land_use_snow_ice_frac_2010',
'lulc_check_2010', 'land_use_forest_frac_2015',
'land_use_shrubs_frac_2015', 'land_use_grass_frac_2015',
'land_use_wetland_frac_2015', 'land_use_crops_frac_2015',
'land_use_urban_frac_2015', 'land_use_water_frac_2015',
'land_use_snow_ice_frac_2015', 'lulc_check_2015',
'land_use_forest_frac_2020', 'land_use_shrubs_frac_2020',
'land_use_grass_frac_2020', 'land_use_wetland_frac_2020',
'land_use_crops_frac_2020', 'land_use_urban_frac_2020',
'land_use_water_frac_2020', 'land_use_snow_ice_frac_2020',
'lulc_check_2020', 'slope_deg', 'aspect_deg', 'median_el', 'mean_el',
'max_el', 'min_el', 'elevation_m', 'prcp', 'tmin', 'tmax', 'vp', 'swe',
'srad', 'low_prcp_duration', 'low_prcp_freq', 'high_prcp_duration',
'high_prcp_freq', 'mean_runoff'],
dtype='object')
Check the binary classification balance#
The figure below shows how the binary classification balance changes as a function of both the dictionary size and the prior. Coloured lines in the plot represent different dictionary sizes, and the log x-axis scale reflects the range of priors tested in the form of pseudo counts applied to \(Q\), the simulated UAR distribution. The y-axis represents the fraction of True Y values (the target variable), described above as proxy-target (observed-simulated) pairs where the simulated distribution is “closer to” the observed (posterior) compared to the uniform distribution.
The balance of the target variable is sensitive to both the choice of prior and the dictionary size for the range of values tested. Changing the the dictionary size between \(2^4\) and \(2^{12}\) bits changes the proportion of models that are “better than random guessing” by about 3-8% given any prior, while changing the prior from \(10^{-5}\) to \(10^5\) changes the same by between 10 and 25% for a given dictionary size. Smaller priors correspond to larger penalties for underspecified models, so the prior controls the “selectivity” of the model with respect to the discriminant function, controlling the number of “informative models”. The smallest prior (\(10^{-5}\) pseudo-counts) at 12 bits quantization yields the most selective model where 75% of the pairings are rejected, those whose KL divergence \(D_{KL}(P||Q)\) is less than \(D_{KL}(P||\mathbb{U})\).
The point is to create contrast within the sample set such that the “usefulness” of models and the predictors thereof can be most effectively distinguished by the gradient boosting procedure. “Real” in this context is intended to mean “representative of information about processes governing long-term runoff” contained in the signal, as opposed to vestigial effects of data pre-processing, measurement uncertainty, or the model itself (epistemic & aleatoric uncertainty).
A discriminant function has the effect of attenuating the decision space of the sensor network arrangement problem, because fewer True values means fewer viable monitoring locations.
result_dict = {}
for bitrate in range(4,13):
if bitrate in [5, 7]:
continue
# print(f'bitrate = {bitrate}')
fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
if partial_counts:
fname = f"KL_results_{bitrate}bits_{revision_date}_partial_counts.csv"
input_data_fpath = os.path.join(input_folder, fname)
nrows = None
df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
result_dict[bitrate] = df
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 12
10 input_data_fpath = os.path.join(input_folder, fname)
11 nrows = None
---> 12 df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
13 result_dict[bitrate] = df
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
899 kwds_defaults = _refine_defaults_read(
900 dialect,
901 delimiter,
(...)
908 dtype_backend=dtype_backend,
909 )
910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/io/parsers/readers.py:583, in _read(filepath_or_buffer, kwds)
580 return parser
582 with parser:
--> 583 return parser.read(nrows)
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1704, in TextFileReader.read(self, nrows)
1697 nrows = validate_integer("nrows", nrows)
1698 try:
1699 # error: "ParserBase" has no attribute "read"
1700 (
1701 index,
1702 columns,
1703 col_dict,
-> 1704 ) = self._engine.read( # type: ignore[attr-defined]
1705 nrows
1706 )
1707 except Exception:
1708 self.close()
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:239, in CParserWrapper.read(self, nrows)
236 data = _concatenate_chunks(chunks)
238 else:
--> 239 data = self._reader.read(nrows)
240 except StopIteration:
241 if self._first_chunk:
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:796, in pandas._libs.parsers.TextReader.read()
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:891, in pandas._libs.parsers.TextReader._read_rows()
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/_libs/parsers.pyx:1049, in pandas._libs.parsers.TextReader._convert_column_data()
File ~/code_5820/data_analysis/lib/python3.10/site-packages/pandas/core/dtypes/common.py:1335, in is_extension_array_dtype(arr_or_dtype)
1326 # Note: if other EA dtypes are ever held in HybridBlock, exclude those
1327 # here too.
1328 # NB: need to check DatetimeTZDtype and not is_datetime64tz_dtype
1329 # to exclude ArrowTimestampUSDtype
1330 return isinstance(dtype, ExtensionDtype) and not isinstance(
1331 dtype, (DatetimeTZDtype, PeriodDtype)
1332 )
-> 1335 def is_extension_array_dtype(arr_or_dtype) -> bool:
1336 """
1337 Check if an object is a pandas extension array type.
1338
(...)
1378 False
1379 """
1380 dtype = getattr(arr_or_dtype, "dtype", arr_or_dtype)
KeyboardInterrupt:
class_balance_df = pd.DataFrame()
for bitrate in range(4,13):
if bitrate in [5, 7]:
continue
df = result_dict[bitrate]
posterior_columns = [c for c in df.columns if c.startswith(f'dkl_{concurrent}_post')]
balances = []
for target_col in posterior_columns:
posterior = target_col.split('_')[-1].split('R')[0]
# if DKL(P||Q) < DKL(P||U), then the model is a "better compressor"
# of the target signal than a uniform distribution
df['binary_target'] = df[target_col] < df[f'dkl_{concurrent}_uniform']
ut, ct = np.unique(df['binary_target'].values, return_counts=True)
pct_false = ct[0] / len(df)
balances.append(1-pct_false)
# change target_col to the new binary target
# print(f'The binary target variable balance is {100*pct_false:.0f}% False and {100*(1-pct_false):.0f}% True')
class_balance_df[bitrate] = balances
class_balance_df.index = [10**(int(c.split('_')[-1].split('R')[0])) for c in posterior_columns]
bal_fig = figure(width=600, height=400, x_axis_type='log')
n = 0
for c in class_balance_df.columns:
bal_fig.line(class_balance_df.index, class_balance_df[c], color=Category20[8][n],
line_width=2, legend_label=f'{c}bits')
n += 1
bal_fig.xaxis.axis_label = r'$$\text{Prior Q (Pseudo-counts)} [ 10^\alpha ]$$'
bal_fig.yaxis.axis_label = r'$$\text{P(True) } [ \% ]$$'
bal_fig.legend.location = 'bottom_right'
show(bal_fig)
Decreasing the prior results in larger penalties for underspecified models, that is where some \(q_i = 0\). The shift in the classification balance, that is the pairs that change from True to False as the prior decreases with respect to \(D_{KL}(P||Q) < D_{KL}(P||\mathbb{U})\), includes catchment pairs where the proxy \(Q\) underspecifies the state frequencies of \(P\), and the tolerance of underspecification is proportional to the observed (posterior) state frequency. Increasing the dictionary size preserves more of the original signal precision, and it also gives greater discriminatory power to the prior, as shown by the greater change in y in the plot above.
Next we’ll invert the plot to show how the binary label balance changes as a function of dictionary size (bitrate) for different priors.
cb_df = pd.DataFrame()
brates = []
for posterior in posterior_columns:
balances = []
for bitrate in range(4,13):
if bitrate in [5, 7]:
continue
model_tests = result_dict[bitrate][posterior] < result_dict[bitrate][f'dkl_{concurrent}_uniform']
ut, ct = np.unique(model_tests, return_counts=True)
pct_false = ct[0] / len(model_tests)
balances.append(1-pct_false)
cb_df[posterior] = balances
cb_df.index = [2**e for e in [4, 6, 8, 9, 10, 11, 12]]
# class_balance_df.index = [10**(int(c.split('_')[-1].split('R')[0])) for c in posterior_columns]
cb_df
bal_fig2 = figure(width=700, height=450, x_axis_type='log')
n = 0
for c in cb_df.columns:
a = c.split('_')[-1].split('R')[0]
bal_fig2.line(cb_df.index, cb_df[c], color=Category20[20][n],
line_width=2, legend_label=f'10^{a}')
n += 1
bal_fig2.xaxis.axis_label = r'$$\text{Dictionary Size}$$'
bal_fig2.yaxis.axis_label = r'$$\text{P(True) } [ \% ]$$'
bal_fig2.add_layout(bal_fig2.legend[0], 'right')
bal_fig2.legend.location = 'bottom_right'
show(bal_fig2)
def add_attributes(attr_df, df_relations, attribute_cols):
"""
Adds attributes from the df_attributes to the df_relations based on the 'proxy' and 'target' columns
using map for efficient lookups.
Parameters:
df_attributes (pd.DataFrame): DataFrame with 'id' and attribute columns.
df_relations (pd.DataFrame): DataFrame with 'proxy' and 'target' columns.
attribute_cols (list of str): List of attribute columns to add to df_relations.
Returns:
pd.DataFrame: Updated df_relations with added attribute columns.
"""
# Create dictionaries for each attribute for quick lookup
attr_dicts = {col: attr_df.set_index('official_id')[col].to_dict() for col in attribute_cols}
# Add target attributes
for col in attribute_cols:
df_relations[f'target_{col}'] = df_relations['target'].map(attr_dicts[col])
# Add proxy attributes
for col in attribute_cols:
df_relations[f'proxy_{col}'] = df_relations['proxy'].map(attr_dicts[col])
return df_relations
def predict_KLD_from_attributes(attr_df, target_col_base, holdout_pct, stations, nfolds, results_folder, loss_function=None, partial_counts=False, binary_test=False):
training_stn_cv_sets, test_stn_sets = dpf.train_test_split_by_official_id(holdout_pct, stations, nfolds)
all_test_results = {}
for bitrate in [4, 6, 8, 9, 10, 11, 12]:
t0 = time()
print(f'bitrate = {bitrate}')
fname = f"KL_results_{bitrate}bits_{revision_date}.csv"
if partial_counts:
fname = f"KL_results_{bitrate}bits_{revision_date}_partial_counts.csv"
input_data_fpath = os.path.join(input_folder, fname)
nrows = None
df = pd.read_csv(input_data_fpath, nrows=nrows, low_memory=False)
df.dropna(subset=[target_col_base], inplace=True)
t1 = time()
print(f' {t1-t0:.2f}s to load input data')
# add the attributes into the input dataset
df = add_attributes(attr_df, df, all_attributes)
if binary_test == True:
# if DKL(P||Q) < DKL(P||U), then the model is a better compressor of the target signal than a uniform distribution
df['binary_target'] = df[target_col_base] < df[f'dkl_{concurrent}_uniform']
ut, ct = np.unique(df['binary_target'].values, return_counts=True)
pct_false = ct[0] / len(df)
# change target_col to the new binary target
target_col = 'binary_target'
print(f'The binary target variable balance is {100*pct_false:.0f}% False and {100*(1-pct_false):.0f}% True')
else:
target_col = target_col_base
all_test_results[bitrate] = {}
input_attributes = []
# add attribute groups successively
for attribute_set, set_name in zip([land_cover, terrain, soil, climate], attribute_set_names):
print(f' Processing {set_name} attribute set: {target_col}')
input_attributes += attribute_set
features = dpf.format_features(input_attributes)
if binary_test == True:
trial_df, test_df = dpf.run_binary_xgb_trials_custom_CV(
bitrate, set_name, features, target_col, df,
training_stn_cv_sets, test_stn_sets, n_optimization_rounds,
nfolds, n_boost_rounds, results_folder, loss=loss_function, eval_metric='error'
)
obs, pred = test_df['actual'].values, test_df['predicted'].values
tn, fp, fn, tp = confusion_matrix(obs, pred).ravel()
test_accuracy = (tp + tn) / (tp + fp + fn + tn)
print(f' held-out test accuracy: {test_accuracy:.2f}')
# store the test set predictions and actuals
all_test_results[bitrate][set_name] = {
'trials': trial_df, 'test_df': test_df,
'test_accuracy': test_accuracy}
else:
trial_df, test_df = dpf.run_xgb_trials_custom_CV(
bitrate, set_name, features, target_col, df,
training_stn_cv_sets, test_stn_sets, n_optimization_rounds,
nfolds, n_boost_rounds, results_folder, loss=loss_function
)
test_rmse = root_mean_squared_error(test_df['actual'], test_df['predicted'])
test_mae = mean_absolute_error(test_df['actual'], test_df['predicted'])
print(f' Held-out test rmse: {test_rmse:.2f}, mae: {test_mae:.2f}')
print('')
# store the test set predictions and actuals
all_test_results[bitrate][set_name] = {
'trials': trial_df, 'test_df': test_df,
'test_mae': test_mae, 'test_rmse': test_rmse}
return all_test_results
Run Binary Model#
loss_function = 'binary:hinge'
binary_results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results_binary')
if not os.path.exists(binary_results_folder):
os.makedirs(binary_results_folder)
for prior in priors_to_test:
print(f'Starting tests based on 10^{prior} pseudo-count prior.')
target_col = f'dkl_{concurrent}_post_{prior}R'
test_results_fname = f'{target_col}_{prior}_prior_results_binary.npy'
test_results_fpath = os.path.join(binary_results_folder, test_results_fname)
if os.path.exists(test_results_fpath):
all_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
all_results = predict_KLD_from_attributes(attr_df, target_col, holdout_pct, station_ids, nfolds, binary_results_folder, loss_function=loss_function, binary_test=True)
np.save(test_results_fpath, all_results)
Starting tests based on 10^-2 pseudo-count prior.
Starting tests based on 10^-1 pseudo-count prior.
Starting tests based on 10^0 pseudo-count prior.
Starting tests based on 10^1 pseudo-count prior.
Starting tests based on 10^2 pseudo-count prior.
Run Regression Models#
results_folder = os.path.join(BASE_DIR, 'data', 'kld_prediction_results')
if not os.path.exists(binary_results_folder):
os.makedirs(binary_results_folder)
for prior in priors_to_test:
target_col = f'dkl_{concurrent}_post_{prior}R'
test_results_fname = f'{target_col}_{prior}_prior_results.npy'
test_results_fpath = os.path.join(results_folder, test_results_fname)
if os.path.exists(test_results_fpath):
print('processed and loading: ', test_results_fname)
all_test_results = np.load(test_results_fpath, allow_pickle=True).item()
else:
all_test_results = predict_KLD_from_attributes(attr_df, target_col, holdout_pct, station_ids, nfolds, results_folder)
np.save(test_results_fpath, all_test_results)
processed and loading: dkl_concurrent_post_-2R_-2_prior_results.npy
processed and loading: dkl_concurrent_post_-1R_-1_prior_results.npy
processed and loading: dkl_concurrent_post_0R_0_prior_results.npy
processed and loading: dkl_concurrent_post_1R_1_prior_results.npy
processed and loading: dkl_concurrent_post_2R_2_prior_results.npy
def load_result_by_prior(prior, binary=False):
fname = f'dkl_{concurrent}_post_{prior}R_{prior}_prior_results.npy'
if binary == True:
fname = fname.replace('.npy', '_binary.npy')
rf = binary_results_folder
else:
rf = results_folder
fpath = os.path.join(rf, fname)
return np.load(fpath, allow_pickle=True).item()
Plot Results#
Plot Results of Binary Classification Test#
layout_dict = {}
c = 0
bin_plots = []
for prior in priors_to_test:
result = load_result_by_prior(prior, binary=True)
title = f'10^{prior} Prior: Binary Test Results '
fig = figure(title=title, x_range=attribute_set_names, toolbar_location='above')
fig.yaxis.axis_label = 'Accuracy score (tp+tn)/(N)'
fig.xaxis.axis_label = 'Attribute Group (additive)'
for b, set_dict in result.items():
y = [set_dict[e]['test_accuracy'] for e in attribute_set_names]
source = ColumnDataSource({'x': attribute_set_names, 'y': y})
fig.line('x', 'y', legend_label=f'{b}bits',
color=Category20[10][c], source=source, line_width=3)
fig.legend.background_fill_alpha = 0.6
result_df = pd.DataFrame({'set': attribute_set_names, 'accuracy': y})
c += 1
bin_plots.append(fig)
c = 0
layout = gridplot(bin_plots, ncols=3, width=350, height=300)
show(layout)
Binary Results Discussion#
Without additional context, the above plots might lead us to believe that the prior has very little effect on the accuracy performance of the xgboost model but perhaps a higher prior leads to slightly better performance. Given the additional context provided by the classification balance, we might identify that the
The equal UAR model is based on mapping “observed” streamflow at one location to another. The dictionary size and the prior are two key assumptions that affect both the discriminant and loss functions upon which decisions and actions are ultimately based. The dictionary size controls the precision, or the confidence we have about the “true” value of observations. A perfect model will not be affected by the dictionary size, since however narrow we make intervals defining states, the model frequencies will equal the posterior (observed frequencies). By the same token, the choice of prior does not affect the discriminant function of a perfect model since it always predicts observed states. Our goal is to discriminate between imperfect models, to tune the dials of prior and quantization to bring into focus the true distinction between imperfect models without excessively introducing artifacts arising from incomplete information. Too small a dictionary filters out real information that the discriminant function would otherwise aim to exploit. Increasing the dictionary size leads to oversampling and increased sensitivity to outliers. The choice of prior can also be thought of as controlling contrast between samples in the discriminant function where predicted state frequencies are small compared to the assumed prior.
A low prior increases the penalty for “misaligned” distributions, that is models that yield 0 probability for some observed state(s). This characteristic is reflected in the class balance of Y as the prior is varied. ncreasing the dictionary size has a magnifying effect on the loss function as the “alignment” of bins
Plot Results of \(D_{KL}\) Regression Test#
layout_dict = {}
reg_plots_dict = {}
res_r2_dict = {}
for prior in priors_to_test:
plots = []
result = load_result_by_prior(prior, binary=False)
reg_plots_dict[prior] = {}
res_r2_dict[prior] = {}
for b, set_dict in result.items():
test_rmse, test_mae = [], []
attribute_sets = list(set_dict.keys())
y1 = [set_dict[e]['test_rmse'] for e in attribute_sets]
y2 = [set_dict[e]['test_mae'] for e in attribute_sets]
source = ColumnDataSource({'x': attribute_sets, 'y1': y1, 'y2': y2})
title = f'{b} bits (Q(θ|D)∼Dirichlet(α=10^{prior}))'
if len(plots) == 0:
fig = figure(title=title, x_range=attribute_sets, toolbar_location='above')
else:
fig = figure(title=title, x_range=attribute_sets, y_range=plots[0].y_range, toolbar_location='above')
fig.line('x', 'y1', legend_label='rmse', color='green', source=source, line_width=3)
fig.line('x', 'y2', legend_label='mae', color='dodgerblue', source=source, line_width=3)
fig.legend.background_fill_alpha = 0.6
fig.yaxis.axis_label = 'Error'
fig.xaxis.axis_label = 'Attribute Group (additive)'
result_df = pd.DataFrame({'set': attribute_sets, 'rmse': y1, 'mae': y2})
best_rmse_idx = result_df['rmse'].idxmin()
best_mae_idx = result_df['mae'].idxmin()
best_rmse_set = result_df.loc[best_rmse_idx, 'set']
best_mae_set = result_df.loc[best_mae_idx, 'set']
best_result = set_dict[best_rmse_set]['test_df']
xx, yy = best_result['actual'], best_result['predicted']
slope, intercept, r, p, se = linregress(xx, yy)
sfig = figure(title=f'Test: {b} bits best model {best_rmse_set} (N={len(best_result)})', toolbar_location='above')
sfig.scatter(xx, yy, size=1, alpha=0.6)
xpred = np.linspace(min(xx), max(xx), 100)
ybf = [slope * e + intercept for e in xpred]
sfig.line(xpred, ybf, color='red', line_width=3, line_dash='dashed', legend_label=f'R²={r**2:.2f}')
# plot a 1:1 line
sfig.line([min(yy), max(yy)], [min(yy), max(yy)], color='black', line_dash='dotted',
line_width=2, legend_label='1:1')
sfig.xaxis.axis_label = r'Actual $$D_{KL}$$ [bits/sample]'
sfig.yaxis.axis_label = r'Predicted $$D_{KL}$$ [bits/sample]'
sfig.legend.location = 'top_left'
reg_plots_dict[prior][b] = sfig
res_r2_dict[prior][b] = r**2
plots.append(fig)
plots.append(sfig)
layout_dict[prior] = gridplot(plots, ncols=2, width=350, height=300)
show(layout_dict[-2])
show(layout_dict[-1])
show(layout_dict[0])
show(layout_dict[1])
show(layout_dict[2])
col_figs = []
for prior in [-2, -1, 0, 1, 2]:
col = column([reg_plots_dict[prior][b] for b in [4, 6, 8, 9, 10, 11, 12]])
col_figs.append(col)
reg_layout = row(col_figs)
show(reg_layout)
from bokeh.transform import linear_cmap
from bokeh.models import ColorBar, ColumnDataSource
from bokeh.layouts import gridplot
from bokeh.palettes import Viridis256, gray, magma, Category20
# Convert the nested dict into a DataFrame
df = pd.DataFrame(res_r2_dict).T # Transpose to get priors as columns
df.index.name = 'Prior'
df.columns.name = 'Bitrate'
# Melt the DataFrame to a long format
df_melted = df.reset_index().melt(id_vars='Prior', var_name='Bitrate', value_name='Value')
# Ensure the Bitrate values are ordered correctly (increasing order)
df_melted['Bitrate'] = pd.Categorical(df_melted['Bitrate'], categories=sorted(df_melted['Bitrate'].unique(), reverse=False), ordered=True)
# Create a Bokeh ColumnDataSource
source = ColumnDataSource(df_melted)
# Create a figure for the heatmap
p = figure(title="KL divergence from attributes: R² of test set by Prior and Bitrate",width=600, height=500,
tools="hover", tooltips=[('Value', '@Value')], toolbar_location=None)
# Create a color mapper
mapper = linear_cmap(field_name='Value', palette=magma(256), low=df_melted.Value.min(), high=df_melted.Value.max())
# Add rectangles to the plot
p.rect(x="Prior", y="Bitrate", width=1, height=1, source=source,
line_color=None, fill_color=mapper)
# Add color bar
color_bar = ColorBar(color_mapper=mapper['transform'], width=8, location=(0,0))
p.add_layout(color_bar, 'right')
# Format plot
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.xaxis.axis_label = r'$$Q(θ|D)∼\text{Dirichlet}(\alpha = 10^{a})$$'
p.yaxis.axis_label = r'$$\text{Quantization Bitrate (dictionary size)}$$'
p.axis.major_label_text_font_size = "10pt"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0
# Output the plot to an HTML file and display it
# output_file("heatmap.html")
show(p)
Discussion#
Since KL divergence \(D_{KL}(P||Q) = \sum_{i=1}^{2^b} p_i\log(\frac{p_i}{q_i}) = +\infty \text{ when any } q_i \rightarrow 0\), the simulated \(Q\) is treated as a posterior distribution by assuming a uniform (Dirichlet) prior \(\alpha = [a_1, \dots, a_n]\). The prior \(\alpha\) is an additive array of uniform pseudo-counts used to address the (commonly occurring) case where \(q_i = 0\), that is the model does not predict an observed state \(i\). In this experiment we tested a wide range of priors on an exponential scale, \(\alpha = 10^a, a \in [-2, -1, 0, 1, 2]\).
The scale of the pseudo-count can be interpreted as a strength of belief in the model. Small \(a\) represents strong belief that the model produces a distribution \(Q\) that is representative of the “true” (observed posterior) distribution \(P\), and for a fixed \(p_i\) the effect of a decreasing \(a\) on the discriminant function \(D_{KL}\) yields a stronger penalty for a model that predicts an observed state with 0 probability. Loss functions penalize overconfidence in incorrect predictions, and a prediction of 0 probability of a state which is actually observed should perhaps be thought of as confidence in an incorrect prediction and penalized as such. A large \(a\) represents weak belief that the model produces a distribution \(Q\) that is representative of \(P\), since \(Q\) approaches the uniform distribution \(\mathbb{U}\) as \(a\) increases.
Adding pseudo-counts has the effect of diluting the signal for the gradient boosting model to exploit in minimizing prediction error. Analogously, varying the bitrate, or the size of the dictionary used to quantize continuous streamflow into discrete states, also adds quantization noise since the original streamflow signals are stored in three decimal precision and they are quantized into as few as 4 bits (16 symbol dictionary) and as many as 12 bits (4096 symbol dictionary). The range of dictionary sizes is set to cover the expected range of rating curve uncertainty, which is generally considered multiplicative and expressed as a % of the observed value.
As shown by the results, priors representing the addition of \(10^1 \text{ to } 10^2\) pseudo-counts diminishes the performance of the gradient boosted decision tree model, regardless of the dictionary size, or the number of possible values provided by the quantization. Heavily penalizing unpredicted states does not have as great an impact as anticipated, perhaps as a result of the corresponding \(p_i\) values also being small.